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Abstract 

We present a simple theory which uses thermo- 
dynamic parameters to predict the probabihty 
that a protein retains the wildtype structure 
after one or more random amino acid substitu- 
tions. Our theory predicts that for large num- 
bers of substitutions the probability that a pro- 
tein retains its structure will decline exponen- 
tially with the number of substitutions, with 
the severity of this decline determined by prop- 
erties of the structure. Our theory also predicts 
that a protein can gain extra robustness to the 
first few substitutions by increasing its thermo- 
dynamic stability. We validate our theory with 
simulations on lattice protein models and by 
showing that it quantitatively predicts previ- 
ously published experimental measurements on 
subtilisin and our own measurements on vari- 
ants of TEMl /3-lactamase. Our work unifies 
observations about the clustering of functional 
proteins in sequence space, and provides a basis 
for interpreting the response of proteins to sub- 
stitutions in protein engineering applications. 

Introduction 

The ability to predict a protein's tolerance to 
amino acid substitutions is of fundamental im- 
portance in understanding natural protein evo- 
lution, developing protein engineering strate- 
gies, and understanding the basis of genetic dis- 
eases. Computational and experimental stud- 
ies have demonstrated that both protein sta- 
bility and structure affect a protein's tolerance 
to substitutions. Simulations have shown that 
more stable proteins have a higher fraction of 
folded mutants ^[21 El 12 and that some struc- 
tures are encoded by more sequences than oth- 
ers (HI El E]. Experiments have demonstrated 
that proteins can be extremely tolerant to sin- 
gle substitutions; for example, 84% of single- 
residue mutants of T4 lysozyme |H1 and 65% 
of single-residue mutants of lac repressor jH] 
were scored as functional. For multiple sub- 
stitutions, the fraction of functional proteins 
decreases roughly exponentially with the num- 
ber of substitutions, although the severity of 



this decline varies among proteins jlUl llll I12j . 

Protein mutagenesis experiments have also un- 
derscored the contribution of protein stabil- 
ity to mutational tolerance by finding "global 
suppressor" substitutions that buffer a protein 
against otherwise deleterious substitutions by 
increasing its stability ^1 . 

We unify these diverse experimental and 
computational results into a simple framework 
for predicting a protein's tolerance to substitu- 
tions. A fundamental measure of this tolerance 
is the fraction of proteins retaining the wild- 
type structure after a single random substitu- 
tion, often called the neutrality ^Hl- We ex- 
tend this concept to multiple substitutions by 
defining the m-neutrality as the fraction of pro- 
teins that fold to the wildtype structure among 
all sequences that differ from the wildtype se- 
quence at m residues. Since mutants that fail 
to fold also generally fail to function, the m- 
neutrality provides an upper bound to the frac- 
tion of proteins with m substitutions that re- 
tain biochemical function. We show that a pro- 
tein's m-neutrality can be accurately predicted 
from measurable thermodynamic parameters, 
and that these predictions capture the contri- 
butions of both stability and structure to deter- 
mining a protein's tolerance to substitutions. 

Methods 

Lattice Protein Model 

Wc performed simulations with lattice proteins |16| 
of length L — 20 monomers of 20 types correspond- 
ing to the natural amino acids. We folded the pro- 
teins on a two-dimensional lattice, allowing them to 
occupy any of the 41,889,578 possible compact or 
non-compact conformations. Tlie energy of a con- 
formation C is the sum of the nonbonded nearest- 
neighbor interactions, 

L i-2 
i=l j=l 

where (C) is one if residues i and j are nearest 
neighbors in conformation C and zero otlierwise, and 
e (Ai , Aj ) is the interaction energy between residue 
types Ai and Aj , given by Table 5 of ^Tj . 

The primary advantage of using lattice proteins 
is that we can exactly compute the stability of a 
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conformation Ct as 

AG/ (Ct) = E {Ct)+T\n {Q (T) - exp [-E {Ct) /T]} , 
where Q (T) is the partition sum 

QiT)^J2 o^p[~E{C.)/T] 

over all conformations, made tractable by noting 
that there are only 910,972 unique contact sets. All 
simulations were performed at a reduced tempera- 
ture of T = 1.0 

TEMl /3-Lactamase Mutant Libraries 

To examine the effects of mutations on the re- 
tention of protein function, we constructed mu- 
tant libraries of wildtype and the thermostable 
M182T variant of TEMl /3-lactamase. The 861 
bp genes (a kind gift from Brian Shoichet |18p 
were subcloned into the pM0N:lA2 plasmid [T!^ 
with Sad and Hindlll using PGR primers 5'- 
GCGGCGGAGCTCATGAGTATTCAACATTTCCGT 
GTCGC-3' and 5'-GCGGCG AAGCTT TTACCAATG 
CTTAATCAGTGAGGCAC-3' (restriction sites are 
underlined). We first created a control unmutated 
library by cutting the gene directly from the plas- 
mid. This unmutated gene was used as the template 
for a round of error-prone PGR with 100 fA reac- 
tions containing 3 ng of template, 0.5 /iM of each 
of the above primers, 7 niM MgGl2, 75 fiM MnGl2, 
200 nM of dATP and dGTP, 500 ^iM of dTTP and 
dGTP, IX Apphed Biosystems PGR buffer with- 
out MgGl2, and 5 U of Applied Biosystems Taq 
DNA polymerase. The PGR conditions were 95°C 
for 5 minutes, and then 14 cycles of 30 s each at 
95°C, 50°G, and 72°C. The product from this PGR 
was digested with Sad / Hindi I I and gel purified, 
and then used as the template for another identi- 
cal round of error-prone PGR. This process was re- 
peated to create five libraries with increasing num- 
bers of mutations, which we labeled EP-0 (for the 
unmutated control) to EP-5 (for the product of the 
fifth round of error-prone PGR). We quantified the 
number of doublings for each round by running PGR 
product versus a known standard on an agarose gel, 
and found that our protocol consistently yielded ten 
doublings. 

To measure the fraction of genes in the mutant li- 
braries that still encoded functional proteins, we lig- 
ated the genes into the pM0N:lA2 plasmid with T4 
Quick DNA Ligase in 20 jA reactions containing 50 
ng each of gene and plasmid, and then transformed 
5 /il of the ligation reactions into 50 /il of XLl-Blue 



TEMl mutation frequencies. 



Base pairs sequenced 


22,800 


Total mutations 


172 


Total AA substitutions 


120 


Mutation frequency (%) 


0.75 ± 0.06 


IViutations per gene 


6.5 ± 0.5 


AA substitutions per gene 


4.5 ± 0.4 


Mutation types (%) 
A ^T, T ^A 
A ^C, T 
A ^G, T 
G ^A, C 
G ^C, C ^G 
G C ^A 
frameshift 


22 
9 

42 
20 
1 
3 
3 



Table 1: Mutation frequencies for TEMl (3- 
lactamase mutagenesis determined by sequenc- 
ing 20 unselected clones each from the round 
five wildtype and M182T error-prone PGR li- 
braries. 



Supercompetent cells from Stratagene. The trans- 
formed cells were plated on LB-agar plates contain- 
ing 10 /Ug/ml of kanamycin (selective only for plas- 
mid) and on LB-agar plates containing 10 /ig/ml of 
kanamycin and 20 ^g/ml of ampicillin (selective for 
both plasmid and active TEMl gene) at a density 
that gave 100-300 colonies per unselected plate. The 
fractions functional were computed as the average 
of at least five pairs of selected/unselected plates, 
and are shown in Table 2. 

To test the ability of our theory to predict the 
decline in m-neutrality. 

The mutation frequency in the round five library 
was determined by sequencing the first 570 bp of 
twenty genes each from the unselected wildtype 
and M182T plates with the sequencing primer 5'- 
GGTCGATGTTTGATGTTATGGAGC-3'. The wild- 
type and M182T genes were mutated under identi- 
cal conditions, and the sequencing found the same 
nucleotide mutation frequencies for both (0.77 ± 
0.08% for wildtype and 0.74 ± 0.08% for M18T2, 
corresponding to 6.6 ± 0.7 and 6.4 ± 0.7 nucleotide 
mutations per 861 bp gene). For better statistics, 
the sequencing results for both libraries were com- 
bined to give the data in Table 1. No biases in the 
locations of the mutations were observed. Eleven 
mutations occurred twice, which is in good agree- 
ment with the expectation of eight duplicate mu- 
tations if all possible mutations were equiprobable. 
The per-round mutation frequency was calculated 
as 0.15 ± 0.03% (1.3 ± 0.3 nucleotide mutations per 
gene) by assuming that each round of error-prone 
PGR introduced the same average number of mu- 
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tations. To confirm this assumption, we sequenced 
ten unselected clones each from the wildtype and 
M182T round one Ubraries, and found mutation fre- 
quencies of 0.16 ± 0.05% for wildtype and 0.19 ± 
0.06% for M182T. Standard errors were computed 
assuming Poisson sampling statistics. More detailed 
sequencing information is in Table 4 of the Support- 
ing Information. 

Results 

Thermodynamic Framework for Predict- 
ing Neutrality 

A protein's native structure is thermodynam- 
ically stable |2U1 I21j . with typical free ener- 
gies of folding (AGj) between —5 and —15 
kcal/mol A mutant sequence folds to 

the wildtype structure only if the stability of 
that structure meets some minimal threshold. 
We call the extra stability of the native struc- 
ture beyond this minimal threshold AGJ^^'^^ 
and note that functional proteins always have 
^Qextra. ^ q_ define protein's m-neutrality 
as the fraction of sequences with m substitu- 
tions that still meet the stability threshold. 
A substitution causes a stability change of 

AAG = AG'p^ - AGf 

where AG"* and AG™"* are the wildtype and 
mutant protein stabilities. Substitutions tend 
to be destabilizing: although there are no large 
collections of AAG measurements for truly ran- 
dom substitutions, in a likely-biased collection 
of more than 2,000 measured AAG values for 
single-residue substitutions [^S], the mean is 0.9 
kcal/mol and the values at the 10th and 90th 
percentiles are —1.0 and 3.2. 

The thermodynamic effects of most substitu- 
tions are approximately additive [J^ [2^1 126] . 
meaning that if the stability changes due to 
two different single substitutions are AAG*^ 
and AAG*, then the stability change due to 
both substitutions is approximately AAG" -|- 
AAG''. If we know the probability distribu- 
tion pi (AAG) that a single random substitu- 
tion causes a stability change of AAG, and if we 
assume that substitutions are additive, then the 
net effect AAG™ of m random substitutions 



is just the sum of m random variables from 
the probability distribution pi (AAG). Under 
this additivity assumption, we can therefore 
directly calculate the distribution pm (AAG™) 
for AAG™ by performing an m-fold convolu- 
tion |2Z1 ofpi(AAG). 

The m-neutrality Pf (m) is simply the the 
probability that AAG™ is not more destabi- 
lizing than the extra stability AG^'*'^^ of the 
wildtype sequence, and can be written as 




Pra (AAG™) d( A AG™). 



— oo 

(1) 

This formula gives a protein's m-neutrality in 
terms of its extra stability and the distribution 
of AAG values for all possible single substitu- 
tions. 

Lattice Proteins Support Predictions 

We tested the ability of this simple framework 
to predict the fraction of lattice proteins that 
retained the original structure after random 
amino acid substitutions. Lattice proteins are 
highly simplified models of proteins that pro- 
vide a useful tool for studying protein fold- 
ing [1111211001111] and evolution [111 EH (some 
example lattice proteins are shown in Figure^. 
We can easily measure the m-neutralities of the 
lattice proteins by making random amino acid 
substitutions and seeing if the sequences still 
have AGf < 0.0. We can also use Eq. Q to 
directly predict the m-neutralities since we can 
exactly compute AGf and AAG values. 

Eq. n accurately predicted the m-neutralities 
of all of the lattice proteins we tested. Lat- 
tice proteins with different structures have dif- 
ferent m-neutralities, even when they have 
the same AGf (Figure The 1-neutralities 
of proteins with different structures and the 
same AGf look similar, but for larger values 
of m some proteins clearly show higher m- 
neutralities than others. For large m, the m- 
neutralities of all of the proteins converge to a 
simple exponential of the form 

Pf (m) cx (z^aa)" 
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036036 2 4 6 

AAG number of substitutions 



Figure 1: Lattice proteins with different structures but the same stabihty (AG/ = —1.0) converge 
to different exponential dechnes in m-neutrahty. (a) The distributions of AAG for ah 380 single 
amino acid substitutions to the inset lattice proteins, (b) The measured (symbols) and predicted 
(lines) m-neutralities for the four proteins. Proteins are considered folded if AGj < 0.0 for the 
original native structure. The proteins used for the m-neutrality analyses were generated by adap- 
tive walks from random starting sequences, followed by 2.5 x 10^ generations of neutral evolution 
with a population size of 100 and a per generation per residue substitution rate of 5 x 10~^, select- 
ing for sequences with AG/ < —1.0 and then taking the first sequence generated with a stability 
within 0.025 of -1.0. The m-neutralities were computed by sampling all mutants for m < 2 or 
5 X 10^ random mutants for m > 2. The predicted m-neutralities were computed according to Eq. 
^by numerically convolving the distribution of single-substitution AAG values using generating 
functions [27| computed with fast-Fourier transforms and a bin size of 0.01. 
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where (faa) is the average fraction of proteins 
that are destabihzed by a further single ran- 
dom amino acid substitution after several sub- 
stitutions have already occurred. The under- 
lying reason for the exponential form of this 
decline is clear: after several substitutions the 
distribution of AG/ among the remaining func- 
tional sequences reaches a steady state and 
each new substitution pushes the same frac- 
tion of proteins beyond the stability thresh- 
old. The average neutrality (t'aa) is therefore 
actually the 1-neutrality averaged over all sta- 
ble sequences with the wildtype structure. Al- 
though Pf {m = 1) is similar for all of the pro- 
tein structures in Figure ^ the factors that give 
rise to the different values of (faa) for the dif- 
ferent structures are present in the distribution 
of single mutant A AG values, since it is used 
to predict the m-neutralities for all values of m. 

Figure [2 shows the m-neutralities of proteins 
with the same structure but different stabilities. 
After several substitutions, all of the proteins 
converge to the same value of (z^aa), suggesting 
that (faa) is a generic property of a protein's 
structure. On the other hand, the response of 
a protein to the first few substitutions depends 
strongly on its stability, with more stable pro- 
teins exhibiting higher initial m-neutrality. The 
high initial m-neutrality of stable proteins is 
readily rationalized in terms of the thermody- 
namic model: substitutions tend to disrupt a 
protein's structure by pushing its stability be- 
low the minimal threshold, but proteins with 
an extra stability cushion are buffered against 
the first few substitutions [SSI- Proteins that 
sit on the very margin of the minimal stability 
threshold exhibit lower 1-neutrality than is pre- 
dicted by an exponential decline because these 
proteins are less stable than the average folded 
protein, and so surviving sequences will tend to 
be more stable than the wildtype sequence and 
so be more tolerant to the next substitution. 

Real Proteins Support Predictions 

Our theory makes two main predictions: first, 
that the decline in m-neutrality is determined 
by the AAG values for single amino acid substi- 
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Figure 2: Lattice proteins with the same struc- 
ture but different stabilities have different 1- 
neutralities but have the same average neutral- 
ity ('^aa)- (a) Predicted (lines) and measured 
(symbols) m-neutralities for proteins with dif- 
ferent stabilities and the same structure (III 
in Figure (b) Measured values of the 1- 
neutralities (squares) and average neutralities 
(circles) for proteins with different stabilities 
but the same structures (the plots at left and 
right are for structures I and IV from Figure ^ 
respectively). The sequences were generated by 
finding a sequence with AG/ = —2.0 using the 
procedure described in Figure^ and then using 
this sequence as a starting point for neutral evo- 
lution selecting for the indicated target stabili- 
ties. The proteins with different stabilities are 
highly diverged, with average pairwise sequence 
identities of 15% and 41% for the structures at 
left and right, respectively. The m-neutralities 
were computed as in Figure ^ and (faa) was 
computed as the square root of the 6-neutrality 
divided by the 4-neutrality. 
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TEMl mutant library measurements. 



Round 


(mnt) 


{maa} 


WT 


M182T 



1 
2 
3 
4 
5 


0.0 ± 0.0 
1.3 ± 0.2 
2.6 ± 0.3 
3.9 ± 0.4 
5.2 ± 0.4 
6.5 ± 0.5 


0.0 ± 0.0 
0.9 ± 0.1 
1.8 ± 0.2 
2.7 ± 0.2 
3.6 ± 0.3 
4.5 ± 0.4 


0.76 ± 0.03 
0.59 ± 0.03 
0.47 ± 0.03 
0.28 ± 0.02 
0.18 ± 0.01 
0.13 ± 0.01 


0.74 ± 0.04 
0.68 ± 0.03 
0.54 ± 0.02 
0.45 ± 0.04 
0.28 ± 0.01 
0.20 ± 0.02 



Table 2: Measured fractions of functional proteins in mutant libraries of wildtype and the ther- 
mostable M182T variant of TEMl /5-lactamase. The table shows the number of rounds of error- 
prone PGR, the average number of nucleotide mutations per gene, and the fractions of mutated 
genes that confer ampicillin resistance in E. coli. Values are shown it their standard errors. 



tutions, and second, that among proteins with 
the same structure, more stable variants will 
have higher m-neutralities. We tested these 
predictions against measurements of the frac- 
tions of functional proteins in mutant libraries 
of subtilisin and variants of TEMl /3-lactamase. 
Our theory is designed to predict the fraction of 
proteins that retain the wildtype structure, but 
the experiments measure the fraction of pro- 
teins that retain function. However, since pro- 
teins that fail to fold also generally fail to func- 
tion, our theory provides an upper bound on 
the fraction of functional proteins. We expect 
that for many proteins this upper bound will 
closely approximate the actual fraction func- 
tional since mutagenesis studies suggest that 
most functionally disruptive random substitu- 
tions disrupt the structure rather than specifi- 
cally affect functional residues ^1 1^ ES] . 



To test the ability of our theory to predict 
the decline in m-neutrality, we used data on 
the fractions of functional proteins in subtil- 
isin mutant libraries created by Shafikhani and 
coworkers JU] (population 6B of Table 2 of [TO] , 
normalized by the fraction of functional clones 
in the control libraries) and our own mutant 
libraries of TEMl (Table 2). Each mutant li- 
brary contains a distribution of sequences with 
different numbers of amino acid mutations. The 
form of this distribution is known: the proba- 
bility that a sequence in a library with an aver- 
age of (mnt) nucleotide mutations created by N 
cycles of PGR with a PGR efficiency of A will 



have nint mutations is 



N 



/(m„i) = (l + A)-^^ 



k=0 



k 



I A' 



where x = {mnt) (1 + A) / (NX) [23 EZ]. Sub- 
tilisin was mutagenized using 13 PGR cycles 
with 10 effective doublings jlOj . so is 13 times 
the number of rounds of error-prone PGR and 
A = 0.77. TEMl was mutagenized using 14 
PGR cycles with 10 effective doublings, so is 
14 times the number of rounds and A = 0.71. 
We confirmed that / {nint) accurately describes 
the distribution of mutations in our libraries 
(Figure 5 of the Supporting Information). 

The expected fraction of folded sequences in a 
mutant library is easily calculated from / {mnt) 
and the probability Pj {mnt) that a sequence is 
still functional after mnt nucleotide mutations 



as 



oo 

E 

mnt=o 



f {mnt) X Pf {mnt) ■ 



We calculated the probability Pj {mnt) that 
a sequence was still folded after mnt nucleotide 
mutations by using two existing computer pro- 
grams for estimating the AAG values for sin- 
gle substitutions to proteins with known struc- 
tures (PDB structure IIAV for subtilisin and 
IBTL for TEMl): Gihs and Rooman's PoP- 
MuSiG potential [HH] and Serrano and cowork- 
ers' FOLDEF potential [SHI with van der Waals 
clash energies. Since the genetic code makes nu- 
cleotide mutations more likely to induce some 
amino acid substitutions than others, and since 
error-prone PGR introduces a non-random dis- 
tribution of nucleotide mutations, we weighted 
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each AAG value by the probabihty that it 
would be induced by a single nucleotide mu- 
tation made according to the observed error- 
prone PGR nucleotide mutation frequencies 
(given in Table 1 of JU] for subtilisin and Ta- 
ble 1 of the current work for TEMl). We as- 
signed a AAG of zero to synonymous nucleotide 
mutations since they do not cause an amino 
acid substitution, and we assigned a AAG of 
25 kcal/mol to frameshift and nonsense muta- 
tions since premature truncation is expected to 
inactivate the protein. We ignored the small 
number of substitutions for which PoPMuSiC 
failed to calculated a AAG. With this weighted 
AAG distribution for nucleotide mutations, all 
we needed to construct Pf {rrint) according to 
Eq. m was the value of AGj^*'''^. This can- 
not be measured directly since we do not know 
the minimal stability threshold. However, since 
^(^cxtra Qj^iy influences the initial behavior of 
the m-neutrality and does not affect the limit- 
ing decline (Figure 2), and since we have six 
data points for each protein, we could do a 
least-squares fit of AG^^*'''^ to the data and still 
test the ability of the theory to predict the de- 
cline in the fraction of functional proteins. 

Figure El shows the measured fractions of 
functional proteins for subtilisin and wildtype 
TEMl versus the theoretical predictions made 
with PoPMuSiC and FOLDEF. The theoret- 
ical predictions closely match the measured 
fractions of functional proteins in all cases, 
with subtilisin exhibiting slightly higher m- 
neutralities than TEMl. 

The second major prediction of our theory is 
that among proteins with the same structure, 
more stable variants will exhibit higher initial 
m-neutralities, but converge to same average 
neutrality. To test this prediction, we compared 
the fractions of functional proteins in mutant 
libraries of wildtype and the M182T variant of 
TEMl. The M182T variant differs from wild- 
type by only a single substitution yet is 2.7 
kcal/mol more stable JHl; so we predict that 
it should exhibit a higher fraction of functional 
proteins at the same level of mutation. Figure 
0] shows the measured fractions functional for 
wildtype and the M182T variant, as well as the 
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4 8 12 
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Figure 3: Theoretical predictions and fractions 
of functional proteins in mutant libraries of sub- 
tilisin (dashed lines) and TEMl /3-lactamase 
(solid lines) genes. Thick lines show predictions 
made using PoPMuSiC |3S| and thin lines show 
predictions made using FOLDEF (HHI- The 
TEMl measurements are from Table 2, normal- 
ized by the values from the control unmutated 
library. 




3 6 

library average nucleotide mutations 



Figure 4: The more stable M182T variant of 
TEMl /3-lactamase (dashed lines) exhibits a 
higher fraction of functional mutants relative to 
wildtype (solid lines), as predicted. Thick lines 
show predictions made using PoPMuSiC [HH] 
and thin lines show predictions made using 
FOLDEF [Sni • The measurements are from Ta- 
ble 2, normalized by the values from the control 
unmutated library. 
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theoretical predictions made using both PoP- 
MuSiC and FOLDEF. As predicted, the M182T 
variant exhibits a higher fraction of functional 
proteins, and once again the predictions made 
with both potentials are in good agreement 
with the experimental measurements. 

To further explore the range of possible neu- 
tralities for different proteins, we used AAG 
values from PoPMuSiC to predict the ex- 
pected average neutralities to both amino acid 
substitutions ((t'aa)) and nucleotide mutations 
((z^nt)) for proteins chosen from several differ- 
ent CATH jlU] protein structure classifications. 
Since we do not know AG^^^^ ^or these pro- 
teins, we computed the fraction of proteins ex- 
pected to be inactivated by the 10th mutation 
since after this many mutations effects due to 
the initial protein stability should be small. Ta- 
ble 3 shows the predicted average neutralities to 
both random amino acid substitutions and nu- 
cleotide mutations made according to the mu- 
tation probabilities of our TEMl mutagenesis. 
The predicted average neutralities differ consid- 
erably, showing that our theory predicts that 
different proteins can have substantially differ- 
ent neutralities. 

Discussion 

We have presented a theory for calculating 
the probability that a protein will retain its 
structure after random amino acid substitu- 
tions, and have confirmed the main theoretical 
predictions with simulations and experiments. 
Our theory naturally separates a protein's m- 
neutrality into components due to structure 
and stability. The eventual severity of the ex- 
ponential decline in m-neutrality with the num- 
ber of substitutions is a property of a protein's 
structure. On the other hand, increased stabil- 
ity confers greater tolerance to the first few sub- 
stitutions, in effect allowing a protein to "take a 
few hits" before it is pushed into the inevitable 
structurally determined exponential decline in 
m-neutrality. This increased tolerance to mu- 
tations due to extra stability is probably also 
the underlying reason for the existence of global 
suppressor mutations ^1 E] that buffer pro- 



teins against otherwise deleterious mutations. 

The major assumption underlying our the- 
ory is that the thermodynamic effects of sub- 
stitutions are additive. This assumption is 
clearly not strictly true since protein residues 
do interact. Substitutions are most likely to 
be non-additive if the mutated residues are in 
close contact in a protein's structure j24| I25j . 
Since proteins are large, two randomly chosen 
residues will rarely contact each other, and so 
although the additivity assumption is certainly 
violated for some specific combinations of sub- 
stitutions, it is accurate when averaged over 
all possible substitutions. When we apply our 
theory to measurements of the fraction of mu- 
tant proteins that retain function we are mak- 
ing a second assumption by ignoring the pos- 
sibility that some substitutions may disrupt a 
protein's function in ways other than affecting 
its stability. Therefore, for proteins with a high 
fraction of functional residues, our theory pro- 
vides only an upper bound on the fraction of 
functional proteins. However, our theory's re- 
markable success for both the subtilisin and the 
TEMl mutant libraries suggests that this as- 
sumption is also valid. 

Our theory provides a quantitative ratio- 
nale for earlier work with lattice proteins on 
the organization of functional proteins in se- 
quence space. Bornberg-Bauer and Chan [2] 
proposed that proteins are located in superfun- 
nels in sequence space with the most stable se- 
quence having the most neutral neighbors; oth- 
ers have reported that folded proteins surround 
highly stable prototype sequences in sequence 
space jm 131 E] , and Shakhnovich and cowork- 
ers showed that proteins with a large energy 
gap between the lowest and second lowest en- 
ergy conformations are stabilized against mu- 
tations. We provide a clear explanation: more 
stable proteins are able to tolerate more of the 
possible mutations before unfolding, and so a 
higher fraction of their neighboring sequences 
fold. 

In addition to these stability-based effects, 
different protein structures have different inher- 
ent designabilities, with more sequences fold- 
ing into some structures than others 142| I43j . 
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Predicted Average Neutralities 



PDB 


Protein 


CATH architecture 


Length 






liAV 


subtiiisin 


aP 3-layer sandwich 


269 


0.65 


0.55 


1B9C 


GbV 


fi barrel 


236 


0.62 


0.56 


IB'i'L 


I'EMl /y-iactamase 


aP 3-layer sandwich 


263 


0.58 


0.46 


IRLV 


tRNA endonuclease 


not classihed 


305 


0.55 


0.44 


IH/W 


thymidylate synthase 


aP 2-layer sandwich 


290 


0.50 


0.41 


2BNH 


ribonuclease inhibitor 


a(J horseshoe 


457 


0.45 


0.35 


IHEL 


hen iysozyme 


a orthogonal bundle 


129 


0.43 


0.38 



Table 3: Predictions of the average neutralities of various proteins to both nucleotide mutations 
((j^nt)) and amino acid substitutions ((faa))- The codes for the PDB structures and the CATH [lU] 
architectures are shown along with the lengths of the protein chains in the PDB structures (in all 
cases we consider chain A). The average neutralities are computed by calculating the fraction of 
sequences predicted be inactivated by the 10th mutation or substitution, using AAG values from 
PoPMuSiC IHSI and assuming that the proteins all have the same value of AG^^^^^ as wildtype 
TEMl /3-lactamase. The values of (I'nt) are computed assuming that nucleotide mutations are 
made according to the error-prone PGR mutation frequencies of Table 1. 



Proteins with more designable structures might 
be expected to show a higher average neutral- 
ity since their structures occupy a larger frac- 
tion of sequence space. The average neutrality 
(t'aa) therefore provides a quantitative measure 
of designability that can be estimated with cur- 
rent computational techniques. 

Our work suggests a more nuanced approach 
to experimentally analyzing protein neutralities 
than has been applied in the past. Loeb and 
coworkers have performed a careful analysis 
of the neutralities of several proteins or regions 
of proteins under the assumption of a strict 
exponential decline in m-neutrality. However, 
our work suggests that a protein's m-neutrality 
can deviate from a strict exponential for the 
first few substitutions if the protein has a large 
amount of extra stability, as we show for the 
M182T variant of TEMl. Experimental mu- 
tagenesis studies suggest that during natural 
evolution, proteins accumulate mildly destabi- 
lizing mutations that are counterbalanced by 
stabilizing mutations [^S]- We suggest that it 
is also important to examine whether some nat- 
ural proteins have systematically accumulated 
stabilizing mutations in order to provide them 
with additional robustness JS] to amino acid 
substitutions. 

Our work also has applications in protein en- 
gineering. Directed evolution involves screen- 



ing libraries of mutant proteins for new or im- 
proved functions Each round of directed 
evolution typically introduces only one or two 
amino acid substitutions because the rapid de- 
cline in m-neutrality means that higher muta- 
tion rates will yield libraries of mostly unfolded 
proteins. Our work suggests that using highly 
stable parents for directed evolution should in- 
crease the fraction of folded mutants at a given 
level of substitutions. It also provides a method 
for predicting which structures will better tol- 
erate large numbers of substitutions. 
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Mutations in the round five TEMl libraries 



Clone 


TEMl WT 


TEMl M182T 


1 


G43A (A15T), T64C (F22L). 
C68A (A23D), A90G. G154A (G52S), 
A375C, T407C (L136P), A426G, 
T479G (L160R) 


G122T (R41L), A158T (K53M), 
A185C (E62A), G198A (M66I), 
C204T, T246C, T408C 


2 


T52C (C18R), A71C (H24P), 
A149G (N50S), T200A (M67K), 
A319C (T107P), T332A (LUIH), 
A388T (N130Y), T393C 


A94G (K32E), T161A (I54N), 

A195T, G385A (D129N), A430T (STOP), 

A467G (H156R), T470A (V157E) 


3 


C28A (LIOI), dclT30. T38C (F13S). 
T50C (F17S), A123T, A151G (S51G), 
T190C (F64L), G327C (K1G9N) 


T49C (F17L), T50C (F17S), 
A120G, delG128, C249T, 
T278C (I93T) 


4 


T42C, T200G (M67R). T384A (S128R), 
A423G, G459T (M153I), A503C (N168T), 
delT504 


T33A, T130C (Y44H), A264G, 
G268A (G90S), C466T (H156Y), 
C545T (A182V) 


5 


A77G (E26G), GIOOA (A34T). 

A149G (N50S), C177T, T236A (STOP), 

A418T (I140F) 


A90G, T223A (G75S), G241T (R81G), 
A263G (Q88R), T339A (D113E), 
T410A (L137Q) 


6 


T171G (S57R), A186G. T317G (V106A). 
T465C, A518G (N173S), A549G 


T161C (I54T), T243C, A364G (S122G). 
C427T (P143S). A550G (M184V) 


7 


C27T, T52A (C18S), G67A (A23T), 
T216C, T374C (I125T), A491G (E164G) 


T69A, T174C, A195G, 

G488A (STOP), A559G (T187A) 


8 


T39C, T40C (F14L), T51A (F17L), 
T361A (C121S), A472G (T158A), 
G476A (R159H) 


T41C (F14S), T302A (VIOID), 
A467G (H156R), G476A (R159H), 
A482G (D161G) 


9 


A158T (K53M), A186G, G306A, 
T384G (S128R), T539G (M180R) 


T144C, C276T, A281G (H94R), 
A513G (I171M) 


10 


T41A (F14Y), T86A (V29E). 
A464G (D155G), A570G 


A148T (N50Y), C153T, A199G (M67V), 
C378T 


11 


dclA7, T245A (V82D), T332C (LI IIP) 


C70T (H24Y), T245G (V82G), 
T286A (S96T), A357G 


12 


A95G (K32R), A308T (Y103F), 
G525A 


T33G, G147T, A348G, 
T404C (L135S) 


13 


T447G (F149L), A452G (H151R). 
A513G (I171M) 


A95G (K32R), T225A (STOP). 
G383A (S128N), C416T (T139M) 


14 


A321G, A524G (E175G), A538T (M180L) 


A258C 


Q86H), G267T, A315T 


15 


T112C, A167G (E56G), G566A (R189H) 


T173A 


F58Y), T393G, G539T (T180M) 


16 


T112C, A158G (K53R) 


G340A 


(G114S), C436T, A454G (N152D) 


17 


ClOlG (A34G), T384A (S128R) 


A157T 
C440T ( 


STOP), T403G (L135V), 
T147I) 


18 


A158G (K53R), T233A (V78E) 


delT42, T380A (M127K) 


19 


C28T (LIOF), A516C 


G168T 


E56D), C187T (R63C) 


20 




A352T 


STOP) 



Table 4: The mutations found in the 20 unselected clones from the round five wildtype and M182T 
TEMl /3-lactamase Hbraries. For nonsynonymous mutations, the amino acid change is shown in 
parentheses. 
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Figure 5: The observed distribution of mutations among the 40 sequenced unselected TEMl /3- 
lactamase clones from the round-five mutant hbraries (bars) agrees with the theoretical predictions 
(lines) for the distribution of mutations in an error-prone PGR library. The theoretical predictions 
are made with the equation / (m^t) described in the text, with an average number of nucleotide 
mutations of {rant) = 4.3 for the 570 bp sequenced region of the 861 bp gene. A chi-square test 
demonstrated that the observed distribution is consistent with the theoretical predictions, with a 
P- value > 0.9 that a difference at least this large between the observed and predicted values would 
occur by chance. 
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